Querying and Ranking XML Documents Based on Data Synopses

نویسندگان

Weimin He

Teng Lv

چکیده

There is an increasing interest in recent years for querying and ranking XML documents. In this paper, we present a new framework for querying and ranking schema-less XML documents based on concise summaries of their structural and textual content. We introduce a novel data synopsis structure to summarize the textual content of an XML document for efficient indexing. More importantly, we extend the traditional vector space model to effectively rank XML documents over the proposed data synopses. We conduct extensive experiments over XML benchmark data to demonstrate the advantages of the indexing scheme and the effectiveness of our ranking scheme. We also compare our framework with Lucene to demonstrate our extended TF*IDF scoring function is effective.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Locating XML Documents Using Content and Structure Synopses

In this paper, we present a novel framework for locating schema-less XML documents based on concise data synopses extracted from the documents. We introduce two novel data synopses, content synopsis and positional filter, to summarize the text data in an XML document for the query evaluation. These two data synopses correlate textual with positional information and consider the containment rela...

متن کامل

A synopsis based approach for XML fast approximate querying

XML was born to represent, exchange and publish information on the Web, but now it has spread in many other applications. Due to this success, the W3C has proposed a new query language, XQuery, specifically designed to query XML data. XQuery allows to obtain exact answers to queries; however when applied to large XML repositories or warehouses, such precise queries may require high response tim...

متن کامل

Indexing and Searching XML Documents Based on Content and Structure Synopses

We present a novel framework for indexing and searching schema-less XML documents based on concise summaries of their structural and textual content. Our search query language is XPath extended with full-text search. We introduce two novel data synopsis structures that correlate textual with positional information in an XML document and improves query precision. In addition, we present a two-ph...

متن کامل

Web Retrieval of XML Documents: Practice and Challenges

Web is characterized by a huge amount of very heterogeneous data sources, that differ both in media support and format representation. In this scenario, there is the need of an integrating approach for querying heterogeneous Web documents. To this purpose, XML can play an important role since it is becoming a standard for data representation and exchange over the Web. Due to its flexibility, XM...

متن کامل